NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection
نویسندگان
چکیده
NrichD (http://proline.biochem.iisc.ernet.in/NRICHD/) is a database of computationally designed protein-like sequences, augmented into natural sequence databases that can perform hops in protein sequence space to assist in the detection of remote relationships. Establishing protein relationships in the absence of structural evidence or natural 'intermediately related sequences' is a challenging task. Recently, we have demonstrated that the computational design of artificial intermediary sequences/linkers is an effective approach to fill naturally occurring voids in protein sequence space. Through a large-scale assessment we have demonstrated that such sequences can be plugged into commonly employed search databases to improve the performance of routinely used sequence search methods in detecting remote relationships. Since it is anticipated that such data sets will be employed to establish protein relationships, two databases that have already captured these relationships at the structural and functional domain level, namely, the SCOP database and the Pfam database, have been 'enriched' with these artificial intermediary sequences. NrichD database currently contains 3,611,010 artificial sequences that have been generated between 27,882 pairs of families from 374 SCOP folds. The data sets are freely available for download. Additional features include the design of artificial sequences between any two protein families of interest to the user.
منابع مشابه
REMOTE HOMOLOGY DETECTION WITH HMMs AND STRUCTURAL ISSUES
Computational methods for homology detection between protein sequences have become a central component in genome analysis. Nowadays, sequences of unknown function are routinely searched against databases of known proteins, providing an important aid for sequence annotation and for guiding laboratory experiments. Although homology identification through pairwise sequence matching [1, 2] is still...
متن کاملCascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins.
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequen...
متن کاملFast and accurate semi-supervised protein homology detection with large uncurated sequence databases
Establishing structural and functional relationship between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability is critical for tasks such as inferring the superfamily membership of unannotated proteins (remote homology detection) when no secondary or tertiary structure is available. Recent methods such as profile kernel...
متن کاملA Dihedral Angle Database of Short Sub-sequences for Protein Structure Prediction
Protein structure prediction is considered to be the holy grail of bioinformatics. Ab initio and homology modelling are two important groups of methods used in protein structure prediction. Amongst these, ab initio methods assume that no previous knowledge about protein structures is required. On the other hand homology modelling is based on sequence similarity and uses information such as clas...
متن کاملProtein family classification and functional annotation
With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and large-scale protein annotation and biological knowledge discovery. The Protein Information Resource (PIR) provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR prod...
متن کامل